Skip to content

add model: Qwen3-Omni-30B-A3B-Instruct#18404

Closed
TrevorS wants to merge 7 commits intoggml-org:masterfrom
TrevorS:feature/qwen3-omni
Closed

add model: Qwen3-Omni-30B-A3B-Instruct#18404
TrevorS wants to merge 7 commits intoggml-org:masterfrom
TrevorS:feature/qwen3-omni

Conversation

@TrevorS
Copy link
Copy Markdown
Contributor

@TrevorS TrevorS commented Dec 27, 2025

Adds qwen 3 omni moe

What's working:

What's not built and/or untested:

  • server / api
  • any hardware besides dgx spark
  • any quants (only f16)
  • more image resolutions
  • more characters for tts voices

Examples

animal_div32

describe the image in the style of a documentary

documentary_test.wav

main: loading model: /models/qwen3-omni/thinker-f16.gguf
WARN: This is an experimental CLI for testing multimodal capability.
      For normal use cases, please use the standard llama-cli

 Running in chat mode, available commands:
   /image <path>    load an image
   /audio <path>    load an audio
   /clear           clear the chat history
   /quit or /exit   exit the program

> Hello, please explain the 'transcendental aesthetic' to me.

Hello! The term **"transcendental aesthetic"** comes from the philosophy of **Immanuel Kant**, specifically from his seminal work, the *Critique of Pure Reason* (1781). It refers to the part of his philosophy that investigates the **conditions that make human experience possible**, particularly focusing on **space and time** as the **a priori forms of human sensibility**.

Let’s break it down:

### 1. **What is the "Aesthetic" in Kant’s Sense?**
In Kant’s philosophy, “aesthetic” does not mean “art” or “beauty” in the modern sense. Instead, it refers to the **faculty of sensibility**—our ability to receive impressions through the senses. The *transcendental aesthetic* is about how we **perceive** the world, not how we think about it (which is the domain of the *transcendental logic*).

### 2. **Transcendental vs. Empirical**
- **Empirical**: knowledge based on experience.
- **Transcendental**: knowledge that is **necessary and universal**, not derived from experience but **conditions** for experience to be possible.

So, the *transcendental aesthetic* asks: **What must be true about our minds in order for us to have any experience at all?**

### 3. **The Two Forms of Intuition: Space and Time**
Kant argues that **space** and **time** are not features of the external world as it exists independently of us (what he calls the *noumenal* world), nor are they concepts we learn from experience. Instead, they are **pure forms of intuition**—**a priori structures of our minds** that shape how we perceive everything.

- **Space** is the form of **outer sense**: all objects we perceive as outside us must appear to us in space.
- **Time** is the form of **inner sense**: all our inner states (thoughts, feelings, perceptions) are ordered in time.

For example:
- You can’t imagine an object without imagining it as being somewhere (in space).
- You can’t imagine any event without imagining it as happening at some point (in time).

These are not things we learn from experience—they are the **frameworks** that make experience possible.

### 4. **Why Is This Important?**
Kant’s transcendental aesthetic is crucial because it sets the stage for his entire critical philosophy. By showing that space and time are not properties of things-in-themselves but **conditions of human perception**, he limits the scope of human knowledge.

- We can only know things **as they appear to us** (*phenomena*), not as they are in themselves (*noumena*).
- This leads to his famous distinction between **phenomena** (the world as we experience it) and **noumena** (the world as it is in itself, which is unknowable).

### Summary:
The **transcendental aesthetic** is Kant’s argument that:
- Space and time are not real things in the world or concepts we learn from experience.
- They are **a priori forms of human intuition**—necessary conditions for any experience.
- Therefore, all objects of our experience must appear in space and time.
- This limits human knowledge to the realm of appearances, not to things-in-themselves.

In short: **We don’t perceive the world as it is; we perceive it as it must appear to our minds, structured by space and time.**

Let me know if you’d like a real-world analogy or further explanation!

AI Disclosure

100% of the code in this PR was written by AI.

Branch prior to clean up and rebase: https://github.com/TrevorS/llama.cpp/tree/feature/qwen3-omni-backup-20251226

I don't want to waste anyone's time -- please feel free to tell me to close my PR and go away! :)

Otherwise, I'm happy to work on doing what I need to in order to get any or all of this code merged.

- Add QWEN3OMNI_TALKER architecture enum and string mapping
- Define tensor name mappings for Talker (20-layer MoE transformer)
- Add Code Predictor tensor mappings (5 layers + 15 LM heads)
- Add Code2Wav vocoder tensor mappings (pre-transformer + upsample + decoder)
- Implement Qwen3OmniTalkerModel class with nested config extraction
- Support ModelType.TALKER for speech synthesis pipeline
- Add LLM_ARCH_QWEN3OMNI_TALKER enum and architecture name mapping
- Define Talker tensor keys (transformer, Code Predictor, Code2Wav)
- Add n_thinker_hidden to llama_hparams for cross-model coupling
- Implement qwen3omni_talker.cpp graph builder with MoE routing
- Support 20-layer Talker transformer with 128 experts per layer
- Implement 5-layer Code Predictor with 15 parallel LM heads
- Build Code2Wav vocoder graph (pre-transformer + ConvNeXt upsample + HiFiGAN decoder)
- Add CMakeLists.txt entry for qwen3omni_talker.cpp
- Implement mtmd-tts.cpp CPU inference for Talker + Code2Wav pipeline
- Add mtmd-tts-gpu.cpp CUDA-accelerated graph execution
- Implement mtmd-tts-code2wav.cpp HiFi-GAN vocoder with 16 VQ codebooks
- Support sliding window attention in Code Predictor
- Add RoPE position encoding for autoregressive code prediction
- Implement ConvNeXt upsampling and multi-resolution STFT discriminator
- Update tools/mtmd/CMakeLists.txt to build TTS components
- Implement qwen3omni-audio.cpp for 32-layer Whisper-style audio encoder
- Add Qwen3OmniMmprojModel class in convert_hf_to_gguf.py for dual encoder export
- Support audio encoder flatten + layer norm projection
- Update clip.cpp and clip-impl.h for QWEN3OMNI_AUDIO projector type
- Add audio config normalization for Whisper-style naming
- Update tools/mtmd/models/models.h with Qwen3-Omni audio model registration
- Fix whisper-enc.cpp compatibility with Qwen3-Omni audio pipeline
- Add QWEN3OMNI_VISION projector type constant
- Implement Qwen3OmniVisionMmprojModel class for vision encoder export
- Update qwen3vl-moe.cpp with vision projector support
- Add deepstack layer handling for Qwen3-VL architecture
- Update clip-model.h with QWEN3OMNI_VISION enum
- Support nested thinker.visual.* tensor prefix handling
- Wire TTS pipeline into mtmd-cli.cpp for end-to-end text→speech
- Add --tts flag for Talker model loading
- Integrate mtmd-tts.h API into mtmd.cpp
- Update mtmd-audio.cpp and mtmd-audio.h for audio encoder handling
- Add TTS tests to tools/mtmd/tests.sh
- Support both CPU and GPU inference paths
- Fix EOS token handling for Qwen3-Omni chat format
- Add GPU preparation for Code Predictor inference
- Update llama-context.cpp and llama-context.h for TTS context
- Add llama-cparams.h changes for Talker cache configuration
- Update llama-impl.cpp with Talker-specific helpers
- Add include/llama.h API extensions for TTS
@github-actions github-actions bot added model Model specific examples python python script changes labels Dec 27, 2025
@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Dec 27, 2025

We don't accept fully AI written PRs in mtmd.

The time it takes for contributors to generate such code is much less than the time it takes me to optimize and nitpick them. AI often, if not always generates sub-optimal ggml code.

Otherwise, I'm happy to work on doing what I need to in order to get any or all of this code merged.

Read the contribution guide: break down changes into smaller part, smaller PRs.

For mtmd: don't use AI

@TrevorS
Copy link
Copy Markdown
Contributor Author

TrevorS commented Dec 28, 2025

closing in favor of an incremental approach, starting with #18420

@TrevorS TrevorS closed this Dec 28, 2025
@rodial
Copy link
Copy Markdown

rodial commented Jan 24, 2026

Hi @TrevorS, would it be possible to add this model to the whisper.cpp repository?
https://github.com/ggml-org/whisper.cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples model Model specific python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants